# MODULES
import numpy as np
import folium
from folium import FeatureGroup, LayerControl, Map, Marker, CircleMarker
import requests
from geopy.geocoders import Nominatim
from pyproj import CRS
import pyproj
import json
import os
import pandas as pd
from IPython.display import display, HTML
from sklearn.cluster import DBSCAN
%matplotlib inline
import matplotlib.pyplot as plt
np.set_printoptions(precision=3,suppress=True)
The goal is to find a promising location to promote and sell new artisan ice cream. The constraint is that this location should be within the greater Rochester area in upstate New York to keep the work commute for the owners within reasonable bounds. The target of this project are the owners or any other locals interested in the ice cream business.
The challenge is that there are already a number of well-established ice cream parlors. Even though we expect that the new owners would not shy away from competition by just setting up business in popular locaions, it may be prudent to examine how many ice cream shops some particular locations can support, what makes them special, and, if there are locations with similar characteristics that are not discovered yet by the ice cream business.
This project aims to answer some of these questions to give the new ice cream shop owners the best shot of establishing a successful business.
# Paraemters
Load_from_files = True # if set True data is loaded from files from earlier sessions rather ...
# ... than calling functions/methods/APIs (e.g., foursquares) generting them
There are two locations on which the analysis is based on:
Their coordinates obtained from 'geopy' and converted through 'pyproj' into a distance are:
if Load_from_files:
PointsOfInt = np.load('DataFiles\\PointsOfInt.npy').item()
for i in PointsOfInt.keys():
print(i,': ',PointsOfInt[i]['point'])
else:
PointsOfInt = {'Rochester': {'address':'Rochester, NY', 'point': None},
'Fairport': {'address':'Fairport, NY','point': None}}
geolocator = Nominatim(user_agent="Monroe_explorer")
for PoI in PointsOfInt.items():
location = geolocator.geocode(PoI[1]['address'])
PoI[1]['point']= location.point[:2]
print(PoI[0],': ',PoI[1]['point'], sep='')
geod_wgs84 = CRS("epsg:4326").get_geod()
az12, az21, dist = geod_wgs84.inv(PointsOfInt['Rochester']['point'][1] , PointsOfInt['Rochester']['point'][0],
PointsOfInt['Fairport']['point'][1], PointsOfInt['Fairport']['point'][0])
print(f'Distance between them:{dist*0.62137/1000:4.1f} mi')
An extensive search on data such as statistics on towns and neighborhoods in databases on Monroe county (or greater Rochester ara) was conducted leading to sites such as NYU Spatial Data Repository, Monroe County GIS, NYS GIS Clearing house, etc. Instead of collecting sparse data, converting the different formats, parsing data from individual web pages, etc., it was found that retrieving data directly from different venues from Foursquare and then processing it is more effective.
For the queries the more general term 'explore' was used rather than 'search' to be more inclusive. Multiple queries are conducted with terms that may serve as surrogates to ‘ice cream’.
The relevant data was than cast into Pandas dataframes. The 'Categories' colum contains all categories from the JSON files generated by the queries to catch as much as possible. They were concatenated into a string to be used in later filtering.
# FourSquare ID
CLIENT_ID = 'xxxxxxxxx' # Foursquare ID
CLIENT_SECRET = 'xxxxx' # Foursquare Secret
VERSION = '20180604'
LIMIT = 300
# Query Parameters
Location = 'Rochester'
queries = ['Ice Cream','Coffee','Latte', 'Cafe','Beer', 'Breweries', 'Pubs']
radius = dist+1000
display(HTML(f"<h3>Query Scores:</h3>"))
QueryResults = []
if Load_from_files:
if os.path.exists('JSONfiles'):
for query in queries:
try:
with open('JSONfiles\\'+query+'.json') as json_file:
results = json.load(json_file)
except: print('File for \'{query}\' missing')
QueryResults.append(results)
print(f'Query {query} produced {len(results["response"]["groups"][0]["items"])} hits.' )
else: Print('json file folder missing')
else:
for query in queries:
url = ('https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'\
.format(CLIENT_ID, CLIENT_SECRET, PointsOfInt[Location]['point'][0], PointsOfInt[Location]['point'][1],
VERSION, query, radius, LIMIT))
results = requests.get(url).json()
with open('JSONfiles\\'+query+'.json', 'w') as outfile:
json.dump(results, outfile)
QueryResults.append(results)
print(f'Query {query} produced {len(results["response"]["groups"][0]["items"])} hits.' )
# Build dataframes
df_queries = dict.fromkeys(queries)
for i,query in enumerate(queries):
df_queries[query] = pd.DataFrame(columns=['Name','Address','City', 'Lat', 'Lng', 'Categories'])
for j, item in enumerate(QueryResults[i]['response']['groups'][0]['items'][:]):
df_queries[query].loc[j,'Name'] = item['venue']['name']
Location = item['venue']['location']
if 'address' in Location.keys(): df_queries[query].loc[j,'Address'] = Location['address']
if 'city' in Location.keys(): df_queries[query].loc[j,'City'] = Location['city']
df_queries[query].loc[j,'Lat'] = Location['lat']
df_queries[query].loc[j,'Lng'] = Location['lng']
cat = item['venue']['categories'][0]
df_queries[query].loc[j,'Categories'] = ', '.join([cat[key] for key in ['name', 'pluralName', 'shortName']])
display(HTML(f"<h3>First few lines of dataframes from the queries:</h3>"))
for query in queries:
display(HTML(f"<h4>Query '{query}':</h4>"))
display(HTML(df_queries[query].head(3).to_html()))
This section combines the dataframes, examines the unique terms, and then dwindles them down to filter keys. The filters produced two general venues or groups that may serve as surrogates and one that is associated with ice cream itself. Cluster analysis and possible correlations between them are examined to explore potential new locations for ice cream parlors.
The descriptive headings/labels for the groups are:
Please go to analysis to learn in more detail how this methodology is applied.
'Ice Cream' is here very loosely used. The idea is to look for places that could compete with ice cream palors.
display(HTML(f"<h4>Unique Categories for 'Ice Cream':</h4>"))
display(pd.Series(df_queries['Ice Cream']['Categories'].unique()))
filter_IceCream = ['Ice Cream Shop', 'Ice Cream Shops', 'Ice Cream',
'Frozen Yogurt Shop', 'Frozen Yogurt Shops', 'Yogurt',
'Dessert Shop', 'Dessert Shops', 'Desserts']
display(HTML(f"<h4>Filter Set:</h4>"))
for item in filter_IceCream[:-1]: print(item, end=', ')
print(filter_IceCream[-1])
filter_IceCream = set(filter_IceCream)
# FILTERING of 'Ice Cream'
# to filter the datafram I split the strings in 'Categories' column by the ', ' into a list and ...
# ... then convert the list into a set. I compare this set with the filter set. If the intersection ...
# ... of both sets is not empty, then I accept it.
idx = []
for i in range(len(df_queries['Ice Cream'])):
idx.append(filter_IceCream & set(df_queries['Ice Cream'].loc[i,'Categories'].split(', ')) !=set())
df_IceCream = df_queries['Ice Cream'][idx]
df_IceCream.reset_index(inplace=True)
df_IceCream = df_IceCream.drop('index',axis=1)
display(HTML(f"<Large>The filtered datafram 'Ice Cream' has now {len(df_IceCream)} entries that represents locations "+
f'that serve ice cream or edibles closly related to ice cream</Large>'))
# Concatenate dataframes
df_Cafe = pd.concat([df_queries[query] for query in queries[1:4]])
df_Cafe.reset_index(inplace=True)
df_Cafe = df_Cafe.drop('index',axis=1)
The dataframes 'Coffee', 'Latte', and 'Café' are combined into one datafram and a similar methodology as for the 'Ice Cream' I applied.
display(HTML(f"<h4>Unique Categories for '\'Caf\u00e9\' ':</h4>"))
display(pd.Series(df_Cafe['Categories'].unique()))
filter_Cafe= ['Coffee Shop', 'Coffee Shops', 'Coffee Shop',
'Café', 'Cafés', 'Café',
'Tea Room', 'Tea Rooms', 'Tea Room']
display(HTML(f"<h4>Filter Set:</h4>"))
for item in filter_Cafe[:-1]: print(item, end=', ')
print(filter_Cafe[-1])
filter_Cafe = set(filter_Cafe)
# FILTERING of 'Coffee places'
idx = []
N_prev = len(df_Cafe)
for i in range(len(df_Cafe)):
idx.append(filter_Cafe& set(df_Cafe.loc[i,'Categories'].split(', ')) !=set())
df_Cafe = df_Cafe[idx]
df_Cafe.reset_index(inplace=True)
df_Cafe = df_Cafe.drop('index',axis=1)
display(HTML(f"<Large>The filtered dataframe 'Caf\u00e9' has now {len(df_Cafe)} entries instead of {N_prev}.</Large>"))
Similar methodology as above ...
# Concatenate dataframes
df_Beer = pd.concat([df_queries[query] for query in queries[4:]])
df_Beer.reset_index(inplace=True)
df_Beer.drop('index',axis=1, inplace=True)
display(HTML(f"<h4>Unique Categories for 'Beer':</h4>"))
display(pd.Series(df_Beer['Categories'].unique()))
filter_Beer = ['Beer Garden', 'Beer Gardens', 'Beer Garden',
'Bar', 'Bars', 'Bar',
'Pub', 'Pubs', 'Pub',
'Beer Bar', 'Beer Bars', 'Beer Bar',
'Irish Pub', 'Irish Pubs', 'Irish']
display(HTML(f"<h4>Filter Set:</h4>"))
for item in filter_Beer[:-1]: print(item, end=', ')
print(filter_Beer[-1])
filter_Beer = set(filter_Beer)
# FILTERING of 'Beer places'
idx = []
N_prev = len(df_Beer)
for i in range(len(df_Beer)):
idx.append(filter_Beer& set(df_Beer.loc[i,'Categories'].split(', ')) !=set())
df_Beer = df_Beer[idx]
df_Beer.reset_index(inplace=True)
df_Beer.drop('index',axis=1, inplace=True)
display(HTML(f"<Large>The filtered dataframe 'Beer' has now {len(df_Beer)} entries instead of {N_prev}.</Large>"))
map_Monroe = folium.Map(location = (PointsOfInt['Rochester']['point'][0],
PointsOfInt['Rochester']['point'][1]), zoom_start=11)
BeerL= FeatureGroup(name='Beer')
CafeL = FeatureGroup(name='Cafe')
IceCreamL = FeatureGroup(name='Ice Cream')
for i in range(len(df_Beer)):
label = folium.Popup(df_Beer.loc[i,'Name'])
folium.CircleMarker([df_Beer.loc[i,'Lat'],df_Beer.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'red',
fill_color = 'orange',
fill_opacity = 0.9).add_to(BeerL)
for i in range(len(df_Cafe)):
label = folium.Popup(df_Cafe.loc[i,'Name'])
folium.CircleMarker([df_Cafe.loc[i,'Lat'],df_Cafe.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'black',
fill_color = 'gray',
fill_opacity = 0.9).add_to(CafeL)
for i in range(len(df_IceCream)):
label = folium.Popup(df_IceCream.loc[i,'Name'])
folium.CircleMarker([df_IceCream.loc[i,'Lat'],df_IceCream.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'blue',
fill_color = 'cyan',
fill_opacity = 0.9).add_to(IceCreamL)
map_Monroe.add_child(BeerL)
map_Monroe.add_child(CafeL)
map_Monroe.add_child(IceCreamL)
map_Monroe.add_child(folium.map.LayerControl())
legend_html = """
<div style=”position: fixed;
bottom: 50px; left: 50px; width: 100px; height: 90px;
border:2px solid grey; z-index:9999; font-size:14px;>
<font size="4" style="color:blue">Ice Cream, </font>
<font size="3"style="color:black"> Coffee, </font>
<font size="3"style="color:Red">Beer. </font>
</div>
"""
map_Monroe.get_root().html.add_child(folium.Element(legend_html))
map_Monroe.save('MapFiles\\map_1.html')
#map_Monroe
By visiting the village of Fairport (town of Perinton) it was originally observed that it has multiple ice cream parlors, coffee or latte shops, breweries and pubs in close proximity. This led to the hypothesis that maybe the underlying economics of the town can support different venues that typically tourist or locals in their leisure time enjoy. Hence, one may argue, if one of the venues in another location or town is missing, e.g., ice cream parlors, maybe there is enough demand to support new business of that missing venue.
In essence all the searches tied to the two groups of terms:
serve as a surrogate.
Below is a map to check how Foursquares did on Fairport:
map_Monroe = folium.Map(location = (PointsOfInt['Fairport']['point'][0],
PointsOfInt['Fairport']['point'][1]), zoom_start=15)
BeerL= FeatureGroup(name='Beer')
CafeL = FeatureGroup(name='Cafe')
IceCreamL = FeatureGroup(name='Ice Cream')
for i in range(len(df_Beer)):
label = folium.Popup(df_Beer.loc[i,'Name'])
folium.CircleMarker([df_Beer.loc[i,'Lat'],df_Beer.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'red',
fill_color = 'orange',
fill_opacity = 0.9).add_to(BeerL)
for i in range(len(df_Cafe)):
label = folium.Popup(df_Cafe.loc[i,'Name'])
folium.CircleMarker([df_Cafe.loc[i,'Lat'],df_Cafe.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'black',
fill_color = 'gray',
fill_opacity = 0.9).add_to(CafeL)
for i in range(len(df_IceCream)):
label = folium.Popup(df_IceCream.loc[i,'Name'])
folium.CircleMarker([df_IceCream.loc[i,'Lat'],df_IceCream.loc[i,'Lng']],
radius = 5,
popup = label,
color = 'blue',
fill_color = 'cyan',
fill_opacity = 0.9).add_to(IceCreamL)
map_Monroe.add_child(BeerL)
map_Monroe.add_child(CafeL)
map_Monroe.add_child(IceCreamL)
map_Monroe.add_child(folium.map.LayerControl())
legend_html = """
<div style=”position: fixed;
bottom: 50px; left: 50px; width: 100px; height: 90px;
border:2px solid grey; z-index:9999; font-size:14px;>
<font size="4" style="color:blue">Ice Cream, </font>
<font size="3"style="color:black"> Coffee, </font>
<font size="3"style="color:Red">Beer</font>
<font size="3"style="color:black"> in the village of Fairport. </font>
</div>
"""
map_Monroe.get_root().html.add_child(folium.Element(legend_html))
map_Monroe.save('MapFiles\\map_2.html')
#map_Monroe
As it turns out Foursquare missed a few sites:
The interpretation is that these missed localities have not just caught on yet, e.g., the missed brewery is fairly new, or, that they are simply not that popular. Or, that Foursquare is not the optimal tool for this project. Looking for more suitable APIs, different avenues of approaching the poject, methodologies, etc., is certainly desirable, but unfortunately, due to resource constraints, out of scope.
However, we can malke the argument that if a site shows up on Foursquare it is likely to be popular. Hence, using Foursquare data should yield useful results; the downside being that we may miss some opportunities.
Hence, we procede with the analysis.
The purpose is to examine "higher densities" venues. For this Density-Based Spatial Clustering Applications with Noise, or DBSCAN, from scikit-learn.org is employed.
Essentially the sites are partitioned into core and noise points, the cores representing members of high-density areas. Admittingly, some arbitrary choices are needed. Here we choose:
Before the cluster analysis can take place the geographic coordintes must be converted into local map or cartegraphic coordinates with distance units (feet) rather than angles. For this 'pyproj' was employed with the local reference EPSG:2261.
# Converting lat and longitude into x, y cartegraphic units (in feet)
# Corrrectness was confirmed by computing Fiarport-Rochester distance.
RocProj = pyproj.Proj("+init=EPSG:2261") # central NY map ref
xr, yr = RocProj(PointsOfInt['Rochester']['point'][1],PointsOfInt['Rochester']['point'][0]) # ref coord.
# Ice Cream Places
X_ic = np.empty((len(df_IceCream),2))
for i, (lon, lat) in enumerate(zip(df_IceCream['Lng'],df_IceCream['Lat'])):
X_ic[i,0], X_ic[i,1] = RocProj(lon,lat)
X_ic -= [xr,yr]
# Coffee Places
X_ca = np.empty((len(df_Cafe),2))
for i, (lon, lat) in enumerate(zip(df_Cafe['Lng'],df_Cafe['Lat'])):
X_ca[i,0], X_ca[i,1] = RocProj(lon,lat)
X_ca -= [xr,yr]
# Beer Places
X_be = np.empty((len(df_Beer),2))
for i, (lon, lat) in enumerate(zip(df_Beer['Lng'],df_Beer['Lat'])):
X_be[i,0], X_be[i,1] = RocProj(lon,lat)
X_be -= [xr,yr]
# Ice Cream
Clust_ic = DBSCAN(eps=2000, min_samples=3).fit(X_ic)
idx_ic = []
xy_ic_mean = []
lonlat_ic = []
for i in range(max(Clust_ic.labels_)+1):
idx = np.where(Clust_ic.labels_ == i)[0]
idx_ic.append(idx)
xy_ic_mean.append(np.mean(X_ic[idx,:],axis=0))
lonlat_ic.append(RocProj(xy_ic_mean[i][0]+xr,xy_ic_mean[i][1]+yr,inverse = True))
# Cafe
Clust_ca = DBSCAN(eps=1000, min_samples=4).fit(X_ca)
idx_ca = []
xy_ca_mean = []
lonlat_ca = []
for i in range(max(Clust_ca.labels_)+1):
idx = np.where(Clust_ca.labels_ == i)[0]
idx_ca.append(idx)
xy_ca_mean.append(np.mean(X_ca[idx,:],axis=0))
lonlat_ca.append(RocProj(xy_ca_mean[i][0],xy_ca_mean[i][1],inverse = True))
# Beer
Clust_be = DBSCAN(eps=2000, min_samples=3).fit(X_be)
idx_be = []
xy_be_mean = []
lonlat_be = []
for i in range(max(Clust_be.labels_)+1):
idx = np.where(Clust_be.labels_ == i)[0]
idx_be.append(idx)
xy_be_mean.append(np.mean(X_be[idx,:],axis=0))
lonlat_be.append(RocProj(xy_be_mean[i][0],xy_be_mean[i][1],inverse = True))
map_Monroe = folium.Map(location = (PointsOfInt['Rochester']['point'][0],
PointsOfInt['Rochester']['point'][1]), zoom_start=11)
BeerL= FeatureGroup(name='Beer')
CafeL = FeatureGroup(name='Cafe')
IceCreamL = FeatureGroup(name='Ice Cream')
for i in range(len(idx_be)):
for j in idx_be[i]:
label = folium.Popup(df_Beer.loc[j,'Name'])
folium.CircleMarker([df_Beer.loc[j,'Lat'],df_Beer.loc[j,'Lng']],
radius = 5,
popup = label,
color = 'red',
fill_color = 'orange',
fill_opacity = 0.9).add_to(BeerL)
for i in range(len(idx_ca)):
for j in idx_ca[i]:
label = folium.Popup(df_Cafe.loc[j,'Name'])
folium.CircleMarker([df_Cafe.loc[j,'Lat'],df_Cafe.loc[j,'Lng']],
radius = 5,
popup = label,
color = 'black',
fill_color = 'gray',
fill_opacity = 0.9).add_to(CafeL)
for i in range(len(idx_ic)):
for j in idx_ic[i]:
label = folium.Popup(df_IceCream.loc[j,'Name'])
folium.CircleMarker([df_IceCream.loc[j,'Lat'],df_IceCream.loc[j,'Lng']],
radius = 5,
popup = label,
color = 'blue',
fill_color = 'cyan',
fill_opacity = 0.9).add_to(IceCreamL)
map_Monroe.add_child(BeerL)
map_Monroe.add_child(CafeL)
map_Monroe.add_child(IceCreamL)
map_Monroe.add_child(folium.map.LayerControl())
legend_html = """
<div style=”position: fixed;
bottom: 50px; left: 50px; width: 100px; height: 90px;
border:2px solid grey; z-index:9999; font-size:14px;>
<font size="5" style="color:black">Cluster Analysis ("Noise" points left out): </font><br>
<font size="4" style="color:blue">Ice Cream, </font>
<font size="3"style="color:black"> Coffee, </font>
<font size="3"style="color:Red">Beer. </font>
</div>
"""
map_Monroe.get_root().html.add_child(folium.Element(legend_html))
map_Monroe.save('MapFiles\\map_3.html')
#map_Monroe
Referring to the cluster analysis, not surprisingly the largest clusters for "Beer" and "Coffee" are in the city of Rochester.
The remaing isolated clusters are small. Some exhibit in a very broad sense some co-locations (approx. locations):
Unfortunately there is not enough points to examine any (loose) correlations through, e.g., a distance matrix. This idea simply did not pan out. This, may, in part, as mention earlier, be due to lack of data.
Nevertheless examining the 'Ice Cream' clusters may yield some insight. The coordinates and associated addresses of the clusters are:
geolocator = Nominatim(user_agent="Monroe_explorer")
df_cluster_IceCream = pd.DataFrame(columns=['Dist. to Rochester [mi]', 'Long.', 'Lat.','Address',
'Town', 'Sites in Proximity'])
for i in range(len(lonlat_ic)):
df_cluster_IceCream.loc[i,'Dist. to Rochester [mi]']= np.sqrt(xy_ic_mean[i][0]**2+xy_ic_mean[i][1]**2)/5280
df_cluster_IceCream.loc[i,'Long.'] = lonlat_ic[i][0]
df_cluster_IceCream.loc[i,'Lat.'] = lonlat_ic[i][1]
loc = str(lonlat_ic[i][1])+','+str(lonlat_ic[i][0])
addrRaw = geolocator.reverse(loc, addressdetails=True)
addr = addrRaw.address.split(',')
df_cluster_IceCream.loc[i,'Address'] = addr[-7]+', '+ addr[-3]+', '+ addr[-2]
df_cluster_IceCream.loc[i,'Town'] = addr[-5]
df_cluster_IceCream.loc[i,'Sites in Proximity'] = addr[0]
df_cluster_IceCream['Dist. to Rochester [mi]'] = df_cluster_IceCream['Dist. to Rochester [mi]'].astype(float)
df_cluster_IceCream['Long.'] = df_cluster_IceCream['Long.'].astype(float)
df_cluster_IceCream['Lat.'] = df_cluster_IceCream['Lat.'].astype(float)
display(HTML(f"<h3>Locations of the 'Ice Cream Clusters'</h3>"))
df_cluster_IceCream.round(2)
Examining these locations following was noted:
This could make Piffsford Plaza a potential candidate. It has already a 'Coffee' cluster. Other possible candidates are Town of Pittsford and Ontario Beach; they are close to water. Note, they also have 'Beer' clusters.
Even though these are promissing candidates the underlying data is not strong enoug, i.e., a more detailed study with a wider span and local scouting is required, which is out of scope.
The purpose of this study was to find potential new locations for ice cream shops within the greater Rochester area. A good location is a location where demand is high but not yet crowded with competition that may stifle a budding business.
Different venues were explored with Foursquare. Results were filtered, combined and partitioned into groups. One of them representing ‘ice cream’, i.e., places with terms associated with ‘ice cream’. The other two groups examined venues that may serve as surrogates or that may correlate with preferred locations for ice cream shops. The groups were subjected to cluster analysis. Unfortunately there was not enough strong data to establish a good enough connections between the surrogate groups and the ‘ice cream’ group. In part, this was due to not enough available data.
Cluster analysis of the ice cream shops produce six clusters. Looking at them common themes were found such as proximity to water or being at or close to shopping plazas which would suggest higher chance of customer traffic. This prompted a few potential sites for consideration. Even though promising, the data is not strong enough and a more extensive study with a wider scope is recommended.